摘要 :
To efficiently deploy state-of-the-art deep neural network (DNN) workloads with growing computational intensity and structural complexity, scalable DNN accelerators have been proposed in recent years, which are featured by multi-t...
展开
To efficiently deploy state-of-the-art deep neural network (DNN) workloads with growing computational intensity and structural complexity, scalable DNN accelerators have been proposed in recent years, which are featured by multi-tensor engines and distributed on-chip buffers. Such spatial architectures have significantly expanded scheduling space in terms of parallelism and data reuse potentials, which demands for delicate workload orchestration. Previous works on DNN’s hardware mapping problem mainly focus on operator-level loop transformation for single array, which are insufficient for this new challenge. Resource partitioning methods for multi-engines such as CNN-partition and inter-layer pipelining have been studied. However, their intrinsic disadvantages of workload unbalance and pipeline delay still prevent scalable accelerators from releasing full potentials.In this paper, we propose atomic dataflow, a novel graph-level scheduling and mapping approach developed for DNN inference. Instead of partitioning hardware resources into fixed regions and binding each DNN layer to a certain region sequentially, atomic dataflow schedules the DNN computation graph in workload-specific granularity (atoms) to ensure PE-array utilization, supports flexible atom ordering to exploit parallelism, and orchestrates atom-engine mapping to optimize data reuse between spatially connected tensor engines. Firstly, we propose a simulated annealing based atomic tensor generation algorithm to minimize load unbalance. Secondly, we develop a dynamic programming based atomic DAG scheduling algorithm to systematically explore massive ordering potentials. Finally, to facilitate data locality and reduce expensive off-chip memory access, we present mapping and buffering strategies to efficiently utilize distributed on-chip storage. With an automated optimization framework being established, experimental results show significant improvements over baseline approaches in terms of performance, hardware utilization, and energy consumption.
收起
摘要 :
Monolithic SoCs can be decomposed into disparate chiplets that support integration with advanced pack-aging technologies. This concept is promising in reducing the manufacturing cost of large scale SoCs due to the higher yield rat...
展开
Monolithic SoCs can be decomposed into disparate chiplets that support integration with advanced pack-aging technologies. This concept is promising in reducing the manufacturing cost of large scale SoCs due to the higher yield rate and reusability of chiplets. The chiplets should be designed in a modular manner without holistic system knowledge so that they can be reused in different SoCs. However, the design modularity is a major challenge to the networks-on-chip (NoCs) of chiplets.New deadlocks may occur across both the chiplets and the interposer due to the integration, even if the NoC of each individually designed chiplet is deadlock free. However, conventional deadlock freedom approaches are unsuitable to handle such deadlocks because they require holistic knowledge and violate the modularity. Although there are several modular approaches that specifically target at integration-induced deadlocks, their routing is overly restricted and the injection control incurs additional latency. They also lack flexibility in dynamically changing topologies due to their complex software algorithm and the hard-wired components.In this paper, a key insight on the chiplet integration-induced deadlocks is gained, inspired by which a deadlock recovery framework (named UPP) is proposed. Specifically, it is verified that an integration-induced deadlock always involves a stalled upward packet moving from the interposer to the connected chiplet via the vertical link. Thus, UPP detects a deadlock by discovering the upward packet and recovers the system from deadlock by transmitting the upward packet to its destination. Hybrid flow control mechanisms are proposed to enable the upward packet to bypass the buffers and be transmitted via the normal router datapath. To guarantee the ejection of the upward packet after transmission, a lightweight protocol is proposed to reserve ejection queue entries of the network interface. Experimental results show that while adhering to design modularity, UPP provides an average runtime speedup of 3.1%∼10.3% with an area overhead of less than 4%.
收起
摘要 :
Although convolutional neural network (CNN) models have greatly enhanced the development of many fields, the untenable number of parameters and computations in these models yield significant performance and energy challenges in ha...
展开
Although convolutional neural network (CNN) models have greatly enhanced the development of many fields, the untenable number of parameters and computations in these models yield significant performance and energy challenges in hardware implementations. Transferred filter-based methods, as very promising techniques that have not yet been explored in the architecture domain, can substantially compress CNN models. However, their straightforward hardware implementation inherently incurs massive redundant computations, causing significant energy and time consumption. In this work, a highly efficient transferred filter-based engine (TFE) is developed to alleviate this deficiency, with CNN models compressed and accelerated. First, the filters of CNN models are flexibly transferred according to specific tasks to reduce the model size. Then, two hardware-friendly mechanisms are proposed in the TFE to remove duplicate computations caused by transferred filters, which can further accelerate transferred CNN models. The first mechanism exploits the shared weights hidden in each row of transferred filters and reuses the corresponding same partial sums, reducing at least 25% of repetitive computations in each row. The second mechanism can intelligently schedule and access the memory system to reuse the repetitive partial sums among different rows of the transferred filters with at least 25% of computations eliminated. Furthermore, an efficient hardware architecture is proposed in the TFE to fully reap the benefits of the two proposed mechanisms such that different types of networks are flexibly supported. To achieve high energy efficiency, the sub-array-based filter mapping method (SAFM) is proposed, where the process element (PE) sub-array is used as the elementary computational unit to support various filters. Therein, input data can be efficiently broadcast in each PE sub-array and the load can be stripped from each PE and intensively alleviated, which can dramatically reduce the area and power consumption. Excluding MobileNet-like networks that adopt depth-wise convolution, most mainstream networks can be compressed and accelerated by the proposed TFE. Two state-of-the-art transferred filter-based methods, i.e., doubly CNN and symmetry CNN are implemented by exploiting the TFE. Compared with Eyeriss, average speedup improvements of 2.93× and 3.17× are achieved in the convolutional layers of various modern CNNs. The overall energy efficiency can be improved by 12.66× and 13.31× on average. Compared with other state-of-the-art related works, the TFE can maximally achieve a parameter reduction of 4.0×, a speedup of 2.72× and an energy efficiency improvement of 10.74× on VGGNet.
收起
摘要 :
Although convolutional neural network (CNN) models have greatly enhanced the development of many fields, the untenable number of parameters and computations in these models yield significant performance and energy challenges in ha...
展开
Although convolutional neural network (CNN) models have greatly enhanced the development of many fields, the untenable number of parameters and computations in these models yield significant performance and energy challenges in hardware implementations. Transferred filter-based methods, as very promising techniques that have not yet been explored in the architecture domain, can substantially compress CNN models. However, their straightforward hardware implementation inherently incurs massive redundant computations, causing significant energy and time consumption. In this work, a highly efficient transferred filter-based engine (TFE) is developed to alleviate this deficiency, with CNN models compressed and accelerated. First, the filters of CNN models are flexibly transferred according to specific tasks to reduce the model size. Then, two hardware-friendly mechanisms are proposed in the TFE to remove duplicate computations caused by transferred filters, which can further accelerate transferred CNN models. The first mechanism exploits the shared weights hidden in each row of transferred filters and reuses the corresponding same partial sums, reducing at least 25% of repetitive computations in each row. The second mechanism can intelligently schedule and access the memory system to reuse the repetitive partial sums among different rows of the transferred filters with at least 25% of computations eliminated. Furthermore, an efficient hardware architecture is proposed in the TFE to fully reap the benefits of the two proposed mechanisms such that different types of networks are flexibly supported. To achieve high energy efficiency, the sub-array-based filter mapping method (SAFM) is proposed, where the process element (PE) subarray is used as the elementary computational unit to support various filters. Therein, input data can be efficiently broadcast in each PE sub-array and the load can be stripped from each PE and intensively alleviated, which can dramatically reduce the area and power consumption. Excluding MobileNet-like networks that adopt depth-wise convolution, most mainstream networks can be compressed and accelerated by the proposed TFE. Two state-of-the-art transferred filter-based methods, i.e., doubly CNN and symmetry CNN are implemented by exploiting the TFE. Compared with Eyeriss, average speedup improvements of 2.93× and 3.17× are achieved in the convolutional layers of various modern CNNs. The overall energy efficiency can be improved by 12.66× and 13.31× on average. Compared with other state-of-the-art related works, the TFE can maximally achieve a parameter reduction of 4.0×, a speedup of 2.72× and an energy efficiency improvement of 10.74× on VGGNet.
收起
摘要 :
Although convolutional neural network (CNN) models have greatly enhanced the development of many fields, the untenable number of parameters and computations in these models yield significant performance and energy challenges in ha...
展开
Although convolutional neural network (CNN) models have greatly enhanced the development of many fields, the untenable number of parameters and computations in these models yield significant performance and energy challenges in hardware implementations. Transferred filter-based methods, as very promising techniques that have not yet been explored in the architecture domain, can substantially compress CNN models. However, their straightforward hardware implementation inherently incurs massive redundant computations, causing significant energy and time consumption. In this work, a highly efficient transferred filter-based engine (TFE) is developed to alleviate this deficiency, with CNN models compressed and accelerated. First, the filters of CNN models are flexibly transferred according to specific tasks to reduce the model size. Then, two hardware-friendly mechanisms are proposed in the TFE to remove duplicate computations caused by transferred filters, which can further accelerate transferred CNN models. The first mechanism exploits the shared weights hidden in each row of transferred filters and reuses the corresponding same partial sums, reducing at least 25% of repetitive computations in each row. The second mechanism can intelligently schedule and access the memory system to reuse the repetitive partial sums among different rows of the transferred filters with at least 25% of computations eliminated. Furthermore, an efficient hardware architecture is proposed in the TFE to fully reap the benefits of the two proposed mechanisms such that different types of networks are flexibly supported. To achieve high energy efficiency, the sub-array-based filter mapping method (SAFM) is proposed, where the process element (PE) subarray is used as the elementary computational unit to support various filters. Therein, input data can be efficiently broadcast in each PE sub-array and the load can be stripped from each PE and intensively alleviated, which can dramatically reduce the area and power consumption. Excluding MobileNet-like networks that adopt depth-wise convolution, most mainstream networks can be compressed and accelerated by the proposed TFE. Two state-of-the-art transferred filter-based methods, i.e., doubly CNN and symmetry CNN are implemented by exploiting the TFE. Compared with Eyeriss, average speedup improvements of 2.93× and 3.17× are achieved in the convolutional layers of various modern CNNs. The overall energy efficiency can be improved by 12.66× and 13.31× on average. Compared with other state-of-the-art related works, the TFE can maximally achieve a parameter reduction of 4.0×, a speedup of 2.72× and an energy efficiency improvement of 10.74× on VGGNet.
收起
摘要 :
Near-Memory Processing (NMP) systems that integrate accelerators within DIMM (Dual-Inline Memory Module) buffer chips potentially provide high performance with relatively low design and manufacturing costs. However, an inevitable ...
展开
Near-Memory Processing (NMP) systems that integrate accelerators within DIMM (Dual-Inline Memory Module) buffer chips potentially provide high performance with relatively low design and manufacturing costs. However, an inevitable communication bottleneck arises when considering the main memory bus among peer DIMMs and the host CPU. This communication bottleneck roots in the bus-based nature and the limited point-to-point communication pattern of the main memory system. The aggregated memory bandwidth of DIMM- based NMP scales with the number of DIMMs. When the number of DIMMs in a channel scales up, the per-DIMM point-to-point communication bandwidth scales down, whereas the computation resources and local memory bandwidth per DIMM stay the same. For many important sparse data-intensive workloads like graph applications and sparse tensor algebra, we identify that communication among DIMMs and the host CPU easily dominates their processing procedure in previous DIMM-based NMP systems, which severely bottlenecks their performance.To tackle this challenge, we propose that inter-DIMM broadcast should be implemented and utilized in the main memory system of DIMM-based NMP. On the hardware side, the main memory bus naturally scales out with broadcast, where per- DIMM effective bandwidth of broadcast remains the same as the number of DIMMs grows. On the software side, many sparse applications can be implemented in a form such that broadcasts dominate their communication. Based on these ideas, we design ABC-DIMM, which Alleviates the Bottleneck of Communication in DIMM-based NMP, consisting of integral broadcast mechanisms and Broadcast-Process programming framework, with minimized modifications to commodity software-hardware stack. Our evaluation shows that ABC-DIMM offers an 8.33 × geo-mean speedup over a 16-core CPU baseline, and outperforms two NMP baselines by 2.59 × and 2.93 × on average.
收起
摘要 :
Near-Memory Processing (NMP) systems that integrate accelerators within DIMM (Dual-Inline Memory Module) buffer chips potentially provide high performance with relatively low design and manufacturing costs. However, an inevitable ...
展开
Near-Memory Processing (NMP) systems that integrate accelerators within DIMM (Dual-Inline Memory Module) buffer chips potentially provide high performance with relatively low design and manufacturing costs. However, an inevitable communication bottleneck arises when considering the main memory bus among peer DIMMs and the host CPU. This communication bottleneck roots in the bus-based nature and the limited point-to-point communication pattern of the main memory system. The aggregated memory bandwidth of DIMM- based NMP scales with the number of DIMMs. When the number of DIMMs in a channel scales up, the per-DIMM point-to-point communication bandwidth scales down, whereas the computation resources and local memory bandwidth per DIMM stay the same. For many important sparse data-intensive workloads like graph applications and sparse tensor algebra, we identify that communication among DIMMs and the host CPU easily dominates their processing procedure in previous DIMM-based NMP systems, which severely bottlenecks their performance.To tackle this challenge, we propose that inter-DIMM broadcast should be implemented and utilized in the main memory system of DIMM-based NMP. On the hardware side, the main memory bus naturally scales out with broadcast, where per- DIMM effective bandwidth of broadcast remains the same as the number of DIMMs grows. On the software side, many sparse applications can be implemented in a form such that broadcasts dominate their communication. Based on these ideas, we design ABC-DIMM, which Alleviates the Bottleneck of Communication in DIMM-based NMP, consisting of integral broadcast mechanisms and Broadcast-Process programming framework, with minimized modifications to commodity software-hardware stack. Our evaluation shows that ABC-DIMM offers an 8.33 × geo-mean speedup over a 16-core CPU baseline, and outperforms two NMP baselines by 2.59 × and 2.93 × on average.
收起
摘要 :
Bit-serial computation has been a prevailing convolution method to accelerate varying-precision DNNs by slicing a multi-bit data into multiple 1-bit data and transforming a multiplication into multiple additions, where additions o...
展开
Bit-serial computation has been a prevailing convolution method to accelerate varying-precision DNNs by slicing a multi-bit data into multiple 1-bit data and transforming a multiplication into multiple additions, where additions of zero bits are ineffectual, while additions of non-zero bits are repetitive since multiple kernels are quite possible to possess non-zero bits at the same kernel positions. Previous bit-serial accelerators only remove ineffectual additions by skipping computation of zero bits, however, repetitive additions are unable to be eliminated since they compute convolution of each kernel independently. In this work, we propose fused kernel convolution algorithm to eliminate both ineffectual and repetitive additions in bit-serial computation by exploiting bit repetition and bit sparsity in weights, for both convolutional and fully-connected layers. It unifies convolutions of multiple kernels into convolution of one fused kernel by firstly grouping additions into different patterns and secondly reconstructing convolution results, minimizing addition count. Meantime, the memory accesses of activations and partial sums are decreased due to less convolution count. Then a fused kernel convolution based accelerator, FuseKNA, is designed with compact compute logic, which fully exploits value sparsity of activations and bit sparsity of weights. Benchmarked with a set of mainstream DNNs, FuseKNA improves performance by $4.47 \times$, $2.31 \times$ and $1.81 \times$, energy efficiency by $4.13 \times$, $3.06 \times$ and $2.53 \times$ over state-of-the-art Stripes, Pragmatic and Bit-Tactical.
收起
摘要 :
Bit-serial computation has been a prevailing convolution method to accelerate varying-precision DNNs by slicing a multi-bit data into multiple 1-bit data and transforming a multiplication into multiple additions, where additions o...
展开
Bit-serial computation has been a prevailing convolution method to accelerate varying-precision DNNs by slicing a multi-bit data into multiple 1-bit data and transforming a multiplication into multiple additions, where additions of zero bits are ineffectual, while additions of non-zero bits are repetitive since multiple kernels are quite possible to possess non-zero bits at the same kernel positions. Previous bit-serial accelerators only remove ineffectual additions by skipping computation of zero bits, however, repetitive additions are unable to be eliminated since they compute convolution of each kernel independently. In this work, we propose fused kernel convolution algorithm to eliminate both ineffectual and repetitive additions in bit-serial computation by exploiting bit repetition and bit sparsity in weights, for both convolutional and fully-connected layers. It unifies convolutions of multiple kernels into convolution of one fused kernel by firstly grouping additions into different patterns and secondly reconstructing convolution results, minimizing addition count. Meantime, the memory accesses of activations and partial sums are decreased due to less convolution count. Then a fused kernel convolution based accelerator, FuseKNA, is designed with compact compute logic, which fully exploits value sparsity of activations and bit sparsity of weights. Benchmarked with a set of mainstream DNNs, FuseKNA improves performance by $4.47 \times$, $2.31 \times$ and $1.81 \times$, energy efficiency by $4.13 \times$, $3.06 \times$ and $2.53 \times$ over state-of-the-art Stripes, Pragmatic and Bit-Tactical.
收起
摘要 :
TCAM (Ternary Content-Addressable Memory) is the essential component for high-speed packet classification in modern hardware switches. However, due to its relatively slow update process, recent advances in Software-Defined Network...
展开
TCAM (Ternary Content-Addressable Memory) is the essential component for high-speed packet classification in modern hardware switches. However, due to its relatively slow update process, recent advances in Software-Defined Network (SDN) regard them as the bottleneck to the agile deployment of network services. Rule installation in commodity switches suffers from non-deterministic delays, ranging from a few milliseconds to nearly half a second. The crux of the problem is that TCAM prioritizes rules based on physical addresses. Corresponding entries have to be reallocated according to the priority of an incoming rule, such that the insertion delay grows linearly with the number of existing rules in a TCAM.In this paper, we present Constant-time Alteration Ternary CAM (CATCAM) that can accomplish both lookup queries and update requests for packet classification in a few nanoseconds. The key to fast update is to decouple rule priorities from physical addresses. We propose a matrix-based priority encoding scheme that records the priority relation between rules and can be implemented in 8T SRAM arrays with the emerging Processing In-Memory (PIM) technique. CATCAM also comes with a hierarchical architecture to scale out, its interval-based scheduling scheme guarantees deterministic update performance in all scenarios. CATCAM is developed under full-custom design in the 28 nm process. Evaluation across benchmark workloads shows that CATCAM provides at least three orders of magnitude speedup over state-of-the-art TCAM update algorithms and offers equivalent search capability to conventional TCAM while incurring 0.3% power and 20% area overhead.
收起